A Comparison of String Metrics for Matching Names and Records

نویسندگان

  • William W. Cohen
  • Pradeep Ravikumar
  • Stephen E. Fienberg
چکیده

We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss some issues involved in performing a similar comparision for record-matching techniques, and finally present results for some baseline record-matching algorithms that aggregate string comparisons between fields.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparison of String Distance Metrics for Name-Matching Tasks

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid s...

متن کامل

Private Record Linkage: Comparison of Selected Techniques for Name Matching

Grzebala, Pawel. M.S.C.E. Department of Computer Science and Engineering, Wright State University, 2016. Private Record Linkage: A Comparison of Selected Techniques for Name Matching. The rise of Big Data Analytics has shown the utility of analyzing all aspects of a problem by bringing together disparate data sets. Efficient and accurate private record linkage algorithms are necessary to achiev...

متن کامل

A String Metric for Ontology Alignment

Ontologies are today a key part of every knowledge based system. They provide a source of shared and precisely defined terms, resulting in system interoperability by knowledge sharing and reuse. Unfortunately, the variety of ways that a domain can be conceptualized results in the creation of different ontologies with contradicting or overlapping parts. For this reason ontologies need to be brou...

متن کامل

A Generalization of the Winkler Extension and its Application for Ontology Mapping

Mapping ontologies is a crucial process when facilitating system interoperability and information exchange. Ontology Mapping systems commonly utilize string metrics in the mapping process to compare concept names. String metrics can be extended using the Winkler method, which increases the similarity value of two strings if these have a common prefix. A common occurrence for two corresponding o...

متن کامل

Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching

Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003